Skip to content

Add Llama NVFP4 PTQ recipes and MLP-only FP8-cast preset.#1645

Closed
kiranbeethoju wants to merge 1 commit into
NVIDIA:mainfrom
kiranbeethoju:feat/llama-nvfp4-ptq-recipes
Closed

Add Llama NVFP4 PTQ recipes and MLP-only FP8-cast preset.#1645
kiranbeethoju wants to merge 1 commit into
NVIDIA:mainfrom
kiranbeethoju:feat/llama-nvfp4-ptq-recipes

Conversation

@kiranbeethoju

@kiranbeethoju kiranbeethoju commented Jun 6, 2026

Copy link
Copy Markdown

Expose huggingface/llama/ptq paths for partial and full NVFP4 on Llama 3.x, add the missing general nvfp4_mlp_only-kv_fp8_cast recipe, and cover loading in unit tests so recipe validation runs on CPU-only hosts.

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features

    • Added NVFP4 quantization recipes with FP8 KV-cache casting for Llama models
    • Introduced MLP-only NVFP4 variant for selective layer quantization
  • Documentation

    • Updated recipe selection guides with Llama 3.x NVFP4 configuration examples
    • Added comprehensive documentation for Llama PTQ recipes, including hardware requirements and KV-calibration guidance

Expose huggingface/llama/ptq paths for partial and full NVFP4 on Llama 3.x,
add the missing general nvfp4_mlp_only-kv_fp8_cast recipe, and cover loading
in unit tests so recipe validation runs on CPU-only hosts.

Signed-off-by: kiranbeethoju <kiranbeethoju@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@kiranbeethoju kiranbeethoju requested review from a team as code owners June 6, 2026 19:04
@kiranbeethoju kiranbeethoju requested a review from cjluo-nv June 6, 2026 19:04
@copy-pr-bot

copy-pr-bot Bot commented Jun 6, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ff7f9f07-13ef-438a-a304-6196f54c5394

📥 Commits

Reviewing files that changed from the base of the PR and between 52f1ccb and 41d56ef.

📒 Files selected for processing (7)
  • examples/llm_ptq/README.md
  • modelopt_recipes/general/ptq/nvfp4_mlp_only-kv_fp8_cast.yaml
  • modelopt_recipes/huggingface/README.md
  • modelopt_recipes/huggingface/llama/ptq/README.md
  • modelopt_recipes/huggingface/llama/ptq/nvfp4_default-kv_fp8_cast.yaml
  • modelopt_recipes/huggingface/llama/ptq/nvfp4_mlp_only-kv_fp8_cast.yaml
  • tests/unit/recipe/test_loader.py

📝 Walkthrough

Walkthrough

This PR introduces new NVFP4 PTQ recipe configurations for Llama models, with both general and Hugging Face–specific variants combining MLP-only and default W4A4 quantization strategies with FP8 KV-cache casting, along with documentation and test coverage.

Changes

NVFP4 PTQ Recipes and Documentation

Layer / File(s) Summary
General NVFP4 MLP-only recipe with FP8 KV cast
modelopt_recipes/general/ptq/nvfp4_mlp_only-kv_fp8_cast.yaml
New general PTQ recipe that applies NVFP4 quantization only to MLP/MoE weight and input quantizers via pattern matching, uses max-based quantization, and composes FP8 KV-cache casting with disabled quantizer configurations.
Hugging Face Llama PTQ recipe variants
modelopt_recipes/huggingface/llama/ptq/nvfp4_default-kv_fp8_cast.yaml, modelopt_recipes/huggingface/llama/ptq/nvfp4_mlp_only-kv_fp8_cast.yaml
Two model-specific recipes for Llama: nvfp4_default-kv_fp8_cast applies W4A4 NVFP4 across all linear layers, and nvfp4_mlp_only-kv_fp8_cast restricts quantization to MLP and MoE layers; both include FP8 KV-cache casting via imported units.
Documentation and guidance for recipe selection
examples/llm_ptq/README.md, modelopt_recipes/huggingface/README.md, modelopt_recipes/huggingface/llama/ptq/README.md
Example README recommends the MLP-only recipe for Llama 3.x NVFP4 quantization; Hugging Face README updated to clarify model-specific recipe discovery; new Llama PTQ README documents recipe variants, KV calibration differences, usage example, and GPU/runtime requirements.
Recipe loader smoke-test coverage
tests/unit/recipe/test_loader.py
Extends _BUILTIN_PTQ_RECIPES test catalog with the new general nvfp4_mlp_only-kv_fp8_cast and two Hugging Face Llama recipe paths to ensure all variants load correctly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1525: Adds Hugging Face NVFP4 PTQ recipes that rely on kv_fp8_cast KV-cache casting via use_constant_amax, which aligns with these recipes' FP8 KV-cache casting composition strategy.

Suggested reviewers

  • ChenhanYu
  • jenchen13
  • yueshen2016
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding Llama NVFP4 PTQ recipes and an MLP-only FP8-cast preset, which aligns with the file changes across documentation and recipe configurations.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR adds YAML recipes, docs, and test updates. No unsafe torch.load, numpy.load, trust_remote_code, eval/exec, nosec comments, or new dependencies detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant